Lab 1 - Hello Spark

This Lab will show you how to work with Apache Spark using Python

Step 1 - Working with Spark Context

Step 1 - Invoke the spark context and extract what version of the spark driver application.

Type
sc.version



In [1]:

    
#Step 1.1 - Check spark version

Step 2 - Working with Resilient Distributed Datasets

Step 2 - Create RDD with numbers 1 to 10,
Extract first line,
Extract first 5 lines,
Create RDD with string "Hello Spark",
Extract first line.

Type:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
x_nbr_rdd = sc.parallelize(x)



In [2]:

    
#Step 2.1 - Create RDD of Numbers 1-10

Type:
x_nbr_rdd.first()



In [3]:

    
#Step 2.2 - Extract first line

Type:
x_nbr_rdd.take(5)



In [4]:

    
#Step 2.3 - Extract first 5 lines

Perform a first map transformation and rpelace each element X in the RDD with X+1.
Type:
x_nbr_rdd_2 = x_nbr_rdd.map(lambda x: x+1)



In [5]:

    
#Step 2.4 - Perform your first map transformation. Replace each element X in the RDD with X+1.
#Remember that RDDs are IMMUTABLE, so it is not possible to UPDATE an RDD. You need to create
#a NEW RDD

Take a look at the elements of the new RDD.
Type:
x_nbr_rdd_2.collect()



In [6]:

    
#Step 2.5 - Check out the elements of the new RDD. Warning: Be careful with this in real life !! As you
#will be bringing all elements of the RDD (from all partitions) to the driver...

Let's now create a new RDD with one string "Hello Spark" and take a look at it.
Type:
y = ["Hello Spark!"]
y_str_rdd = sc.parallelize(y)
y_str_rdd.first()



In [7]:

    
#Step 2.6 - Create String RDD, Extract first line

Let's now create a third RDD with several strings.
Type:
z = ["First,Line", "Second,Line", "and,Third,Line"]
z_str_rdd = sc.parallelize(z)
z_str_rdd.first()



In [8]:

    
#Step 2.7 - Create String RDD with many lines / entries, Extract first line

Count the number of entries in this RDD.
Type:
z_str_rdd.count()



In [9]:

    
#Step 2.8 - Count the number of entries in the RDD

Take a look at the elements of this RDD.
Type:
z_str_rdd.collect()



In [10]:

    
#Step 2.9 - Show all the entries in the RDD. Warning: Be careful with this in real life !! 
#As you will be bringing all elements of the RDD (from all partitions) to the driver...

In the next step, we will split all the entries in the RDD on the commas ","
Type:
z_str_rdd_split = z_str_rdd.map(lambda line: line.split(","))
z_str_rdd_split.collect()



In [11]:

    
#Step 2.10 - Perform a map transformation to split all entries in the RDD on the commas ",".


#Check out the entries in the new RDD


#Notice how the entries in the new RDD are now ARRAYs with elements, where the original
#strings have been split using the comma delimiter.

In this step, we will learn a new transformation besides map: flatMap
flatMap will "flatten" all the elements of an RDD entry into its subcomponents
This is better explained with an example
Type:
z_str_rdd_split_flatmap = z_str_rdd.flatMap(lambda line: line.split(","))
z_str_rdd_split_flatmap.collect()



In [12]:

    
#Step 2.11 - Learn the difference between two transformations: map and flatMap.
#Go back to the RDD z_str_rdd_split defined above using a map transformation from z_str_rdd
#and use this time a flatmap.


#What do you notice ? How is z_str_rdd_split_flatmap different from z_str_rdd_split ?

In this step, we will augment each entry in the previous RDD with the number "1" to create pairs (or tuples). The first element of the tuple will be the keyword and the second elements of the tuple will be the digit "1".
This is a common technic used to count elements using Spark.
Type:
countWords = z_str_rdd_split_flatmap.map(lambda word:(word,1))
countWords.collect()



In [13]:

    
#Step 2.12 - Learn the difference between two transformations: map and flatMap.
#Go back to the RDD z_str_rdd_split defined above using a map transformation from z_str_rdd
#and use this time a flatmap.

Now we have above what is known as a PAIR RDD. Each entry in the RDD has a KEY and a VALUE.
The KEY is the word (First, Line, etc...) and the value is the number "1"
We can now AGGREGATE this RDD by summing up all the values BY KEY
Type:
from operator import add
countWords2 = countWords.reduceByKey(add)
countWords2.collect()



In [14]:

    
#Step 2.13 - Check out the results of the aggregation


#You just created an RDD countWords2 which contains the counts for each token...

Step 3 - Count number of lines with Spark in it

Step 3 - Pull in a spark README.md file,
Convert the file to an RDD,
Count the number of lines with the word "Spark" in it.

Type:
!rm README.md* -f
!wget https://github.com/carloapp2/SparkPOT/blob/master/README.md



In [15]:

    
#Step 3.1 - Pull data file into workbench

Now we will point Spark to the text file stored in the local filesystem and use the "textFile" method to create an RDD named "textfile_rdd" which will contain one entry for each line in the original text file.
We will also count the number of lines in the RDD (which would be as well the number of lines in the text file.
Type:
textfile_rdd = sc.textFile("README.md")
textfile_rdd.count()



In [16]:

    
#Step 3.2 - Create RDD from data file

Let us now filter out the RDD and only keep the entries that contain the token "Spark". This will be achieved using the "filter" transformation, combined with the Python syntax for figuring out whether a particular substring is present within a larger string: substring in string.
We will also take a look at the first line in the newly filtered RDD.
Type:
Spark_lines = textfile_rdd.filter(lambda line: "Spark" in line)
Spark_lines.first()



In [17]:

    
#Step 3.3 - Filter for only lines with word Spark

We will now count the number of entries in this filtered RDD and present the result as a concatenated string.
Type:
print "The file README.md has " + str(Spark_lines.count()) + \
" of " + str(textfile_rdd.count()) + \
" Lines with the word Spark in it."



In [18]:

    
#Step 3.4 - count the number of lines

Using your knowledge from the previous exercises, you will now count the number of times the substring "Spark" appears in the original text.
Instructions:
Looking back at previous exercises, you will need to:
1- Execute a flatMap transformation on the original RDD Spark_lines and split on white space.
2- Augment each token with the digit "1", so as to obtain a PAIR RDD where the first element of the tuple is the token and the second element is the digit "1".
3- Execute a reduceByKey with the addition to count the number of instances of each token.
4- Filter the resulting RDD from Step 3- above to only keep entries which start with "Spark".
In Python, the syntax to decide whether a string starts with a token is string.startswith("token").
5- Display the resulting list of tokens which start with "Spark".



In [19]:

    
#Step 3.5 - count the number of instances of tokens starting with "Spark"

As a slight modification of the cell above, let us now filter out and display the tokens which contain the substring "Spark". (Instead of those which only START with it). Your result should be a superset of the previous result.
The Python syntax to determine whether a string contains a particular "token" is: "token" in string



In [20]:

    
#Step 3.6 - Display the tokens which contain the substring "Spark" in them.

Step 4 - Perform analysis on a data file

We have a sample file with instructors and scores. In this exercise we want you to add all scores and report on results by following these steps:

1- The name of the file is "Scores.txt". Delete it from the local filesystem if it exists.
2- Download the file from the provided location (see below).
3- Load the text file into an RDD of instructor names and instructor scores.
4- Execute a transformation which will keep the instructors names, but will add up the 4 numbers representing the scores per instructor, resulting into a new RDD
5- Display the instructor's name and the total score for each instructor
6- Execute a second transformation to compute the average score for each instructor and display the results.
7- Who was top performer?

The Data File has the following format: Instructor Name,Score1,Score2,Score3,Score4
Here is an example line from the text file: "Carlo,5,3,3,4"
Data File Location: https://raw.githubusercontent.com/carloapp2/SparkPOT/master/Scores.txt



In [21]:

    
#Step 4.1 - Delete the file if it exists, download a new copy and load it into an RDD



In [22]:

    
#Step 4.2 - Execute the necessary transformation(s) to extract the instructor's name, as well
# as the instructors scores, then add up the scores per instructor and display the results
# in the form of a new RDD with the elements: "Instructor Name", InstructorTotals



In [23]:

    
#Step 4.3 - Execute additional transformation(s) to compute the average score per instructor.
# Display the resulting averages for all instructors.



In [ ]: